The section consists of various section of geo analysis of data

Content

  1. Importing Libraries
  2. Loading Data
  3. Preparing Data
  4. Data Viualization
  5. Tweet Text Analysis
  6. Simple sentiment analysis

Importing Libraries

Loading Data

Removing Mising Values

Preparing Data (Countries of users Tweet)

Valid Tweets

Data Visualization

Top 10 Countries with Most Tweets

10 Countries with Least Tweets

Min and Max Dates Between The Dataset

Top 15 Countries with Most Tweets Diffrent Representation

Let's see percent of NaNs for every column. We will visualize only columns with at least 1 missed value.

Let's see top 40 users by number of tweets.

Let's see most popular users.

And most friendly users.

Let's see how coronavirus affect to new users creation.

As we can see from chart coronavirus increases the number of new twitter users.

Let's see top 40 most popular locations by the number of tweets.

And also we can see the pie plot for the full picture about users locations.

Now it's time to check last one categorical feature - source. Lets see top 40 sources by the number of tweets.

3. Additional features analysis

Lets create new feature - hashtags_count that will show us how many hashtags in the current tweet.

And see the values for new created column.

Distribution of new feature over the number of tweets is expected - a lot of tweets with few number of hashtags and few tweets with huge number of hashtags.

Now we will see top 40 users that like to use hashtags a little bit more than others.

Just split day and time into separate columns

Number of unique users per day

Now we are going to check how many tweets were for every day in our dataset.

Lets do the same but for hours

Lets split hashtags into separate column.

And show top 20 hashtags on tweets.

Now we are going to calculate the length for every tweet in dataset.

Tweets text analysis

Here we are going to check the text feature of the dataset.

Lets see general wordcloud for this column.

Lets see world clouds for top 3 users.

Let's also visualize WordCloud for user's description.

Sentiment analysis

using Tfidf Vectorizer to get features and Kmeans clustering algotithm to split data into 3 clusters.

We can see that cluster 0 and 1 contains more or less positive tweets, but cluster 2 contains tweets with information about new cases, reports and regions.

Sentiment Analysis of COVID-19 Tweets using TextBlob

Assigning entities in a list

Dividing the origin of all tweets

Sentiment of Tweets

The tweets having a positive, negative, or neutral sentiment have already been determined. Here, the coding is done to display this information pictorially using explode library, to make a pie chart to display this data (Figures 27 and 28). The classification of the tweets in the three classes is 94.6% neutral, .6% negative, and .2% positive.

Animation with geographical distribution of tweets

Here I am going to show approach how to use plotly world map to demonstrate geographical distribution of tweets.

Build a module for text standardization

Get covid daily cases

APIs for covid-19 cases stats. per country

Get covid cases dataframe for specific country

Preprocessing covid cases dataframe

Create a new column for the date with year

Show the beginning and end of dates

Get covid vaccinations

Get covid vaccinations for specific country

Show the beginning and end of dates

Merging covid cases and vaccinations dataframes

Create a dataframe for sentiments counts

Merging all dataframes for specific country (sentiments counts, covid cases and vaccinations stats)

Filter dataframe to have only rows whose date is in tweets dataframe

Scaling numerical columns for data visualization

Draw a Multiple lines Plot